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Previous years have seen increased interest in a new network paradigm, 
Software defined networking (SDN). The basic idea behind this new concept 
consists of removing the smart parts of the connectivity components and 
moving them to a seul control point known as a controller. This centralized 
network view makes the network maintenance and management easier and 
facilitates the creation of new services. Despite many advantages of SDN, the 
concentration of network intelligence in a single controller raises serious 
challenges that impact SDN scalability, performance, and fault tolerance. One 
of the main problems in SDN is controller failure. In this article, we develop 
a fault tolerant model called fault-tolerant load balancing (FTLBC) for SDN 
controllers. To reduce the cascading failure problem, the proposed model 
requires the load of the failing controller to be shared among other controllers. 
In the case of a controller failure, The FTLBC model concentrates on 
distributing the load among the remaining controllers based on the load of the 


orphans' switches and the load of the remaining controllers. 
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1. INTRODUCTION 

Software-defined networking (SDN) is a novel architecture in the networking domain that has recently 
appeared. The core concept of SDN is to combine the intelligent features of the connectivity equipment and 
direct them toward a single control point known as a controller [1]. The latter provides a single network view 
that simplifies network design and facilitates the creation of new services due to the programmability provided 
by the SDN [2]. With the northbound interface (NBI), the controller makes some functionality available to 
network operators to create a variety of applications using, for instance, a set of representational state transfer 
(REST) application programming interfaces (APIs) [3]. 

The SDN, therefore, offers several advantages, however, there are challenging issues that can hamper 
its performance in large data center networks. For instance, concentrating all network load in a single controller 
causes serious problems that affect the performance, security, and reliability, in addition to the SDN fault 
tolerance which leads to a single point of failure (SPOF) [4], [5]. In SDN networks, fault tolerance techniques 
are necessary to guarantee availability and dependability [2]. For the SDN architecture that recommends having 
one controller, the SPOF issue is more prominent. If the controller fails, SDN networks consistently lose packet 
transmission. Therefore, using multiple controllers might be effective to complete this challenge. While the 
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multi-controller architecture avoids a SPOF and thus improves the SDN reliability, the failure of a controller 
could cause network parallelization. When a controller crashes, the connected switches need to be migrated to 
a different controller. The charge of the failing controller must be split among other controllers to overcome 
the problem of cascading failure [6]. The SDN Load balancing while failure is an important point. Developing 
highly reliable, fault-tolerant, and load-balancing SDN controllers was the major challenge [7]. Load balancing 
takes intelligent actions and ensures better performance and reliability [8]. 

In this work, we address the challenge of fault tolerance and discuss the models used to solve it with 
an emphasis on fault tolerance in SDN architecture with multiple controllers. This paper's objective is to present 
a new fault-tolerant model that considers the load balancing between the SDN controllers. The proposed model 
is called fault-tolerant load balancing (FTLBC) for SDN controllers. In the FTLBC model, the load of the 
failing controller must be distributed to active controllers to reduce the cascading failure problem. If a controller 
has failed, the FTLBC model concentrates on distributing the load across the active controllers considering two 
metrics: the load of the orphan switches and the load of the active controllers. Compared to traditional SDN 
architecture, this FTLBC model can provide enhanced fault tolerance functionalities. The document is 
structured as shown in: section 2 outlines the proposed method along with the topology that was chosen for 
our FTLBC model, section 3 presents the method used for the recovery of our system, section 4 discusses the 
results, and section 5 concludes the paper. 


2. THE PROPOSED METHOD 
2.1. Related work 

To address the issue of SDN controller failures, the authors propose different solutions. As mentioned 
in the previous article [9], there are two types of solutions in the literature: solutions proposed for a single 
controller SDN architecture, and other solutions proposed for a multiple controller SDN architecture, in this 
article we will concentrate on the solutions proposed for a multiple controller SDN architecture. A comparison 
between the proposed approach and the existing approaches with respect to fault tolerance, and load balancing 
is shown in Table 1. 


Table 1. Comparison of the proposed approach with existing approaches 


Method Year Assignment of Metrics to select the Fault Cascading Load Consider the 
switches candidate controller tolerance failure balancing switches load in 
problem- the recovery 
solving process 
Hyperflow 2010 Local Distance v x x x 
[10] 
RDSDN [11] 2018 Global Distance v x x x 


Controller’ load = 
Number of links and 
switches. 
Topology 
failure of links. 
FT-SDN [12] 2020 Local Controller’ load v v x x 
latency 
LBFTFB [13] 2017 Local Assignment cost v v v x 
ECFT [14] 2018 Global Distance v v v x 
Controller! load= 
number of flows per 


second 

Packet loss 
CALB [15] 2019 Global Controller’ load v v v x 

Throughput 
FTLBC - Local Switch’ Load= total v v v v 
(proposed + number of flow table 
method) Global entries 

+ 


Controller’ Load 
=accumulation of the 
Switches Load 


Hyperflow [10] is a widely used fault-tolerant control model. Every controller manages the switches 
in its network and communicates with other controllers from other networks to synchronize the overall status 
of the network. This design for the control plane is logically centralized. In this model, each switch is connected 
to the nearest controller. In case of failure, a switch can be migrated to another controller. When a controller 
fails, the switches attached to this controller should be adjusted to connect to the closest controller by 
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considering the distance metric. The switch-controller reassignment is done locally. In Hyperflow, since it 
blindly selects a neighboring controller in case of failure, it does not offer a solution if the chosen controller is 
not operational which might lead to the cascading failure problem. 

Another approach improving the SDN fault tolerance is reliable distributed SDN (RDSDN) [11]. This 
solution involves adopting a distributed architecture with multiple controllers, where each controller is the 
master of one subnet and a slave of other subnetworks for the purpose of recovery. In case of RDSDN, the 
controller failure is identified by the coordinator, who decides on the alternate controller best suited to take 
over the subnetwork whose controller is failing. Each controller receives a reliability rate, which is shared 
across several controllers. The coordinator is determined by the controller with the highest reliability value. 
This approach is regarded as one of the most appropriate methods since it considers differents metrics to select 
the best controller to support the subnetwork whose controller has failed. However, assigning globally all the 
orphan switches to one controller, even if it is the most reliable controller, can make it overloaded, which can 
cause the problem of cascading failure. 

Although Hyperflow [10], and RDSDN [11] methods address the issue of SDN controller fault 
tolerance, otherwise, affecting all the orphaned switches to the nearby controller or even the most reliable 
controller can cause the problem of cascading failure. So these two methods can result in a cascading failure. 
To address the problem of cascading failure, several solutions were suggested in the literature. The objective 
of these solutions is to find a method that manages the failures of a controller without causing the failures of 
the others controllers. This can be achieved with load balancing between controllers. e.g. if a controller fails, 
the switches associated with this controller must be distributed among other controllers. 

fault-tolerant distributed architecture for software defined networks (FT-SDN) [12] is an architecture 
suggested by the authors to overcome the problem of fault tolerance in SDN. In the FT-SDN architecture, a 
switch is connected to multiple controllers, with one of the controllers acting as a master controller while the 
others remain as slave controllers. Once the master controller fails, one of the slave controllers assumes control 
of the switch and updates the network information base (NIB). This solution minimizes the problem of 
cascading failure since it distributes the switches of the crashed controller to the remaining controllers taking 
into account the load of the controllers and the latency. 

Load balancing to support fault tolerance using feedback control for SDNs (LBFTFB) [13] is another 
method proposed in the literature to provide fault tolerance in SDN networks. For each switch, all slave 
controllers are sorted in ascending order in terms of the assignment cost, the first ranked controller is selected 
as its master. All others are considered slaves. In case of failure of the master, the first slave (i.e. having the 
lowest assignment cost) is automatically used as a master. LBFTFB provides load balancing for several levels 
of backup controllers. The LBFTFB model also reduces the cascading failure problem effect. 

Enhanced controller fault tolerant (ECFT) [14] is a new method for managing controller failures. 
When a controller fails, the ECFT model concentrates on load balancing among other nearby controllers. The 
load of each neighboring controller is determined by ECFT using the delay between switches and the related 
controllers. ECFT chooses a master controller that is in charge of reassigning the orphans’ switches to an 
appropriate backup controller. Since the charge of the failing controller is shared between others controllers, 
this avoids the problem of cascading failure. 

Controller adaptive load balancing model (CALB) [15] is a new approach for load balancing that 
provides fault tolerance for the SDN controllers in case of failure. This method proposes a load balancing 
algorithm used to share the charge among the slave controllers. Acording to CALB, the load balancing consists 
for distributing the charge of the orphaned switches among the slave controllers. 

Since the objective of our FTLBC model is to find an efficient method that can handle controller failures 
and at the same time does not cause the cascading failure problem. for instance, in RDSDN [11] the authors 
propose a reliable method for controllers fault tolerance, but the fact of re-assigning all the orphaned switches to 
one controller, even if it is reliable can cause the problem of cascading failure. That's why we have proposed in 
our model a recovery method that also takes into consideration the load balancing between controllers. In this 
context, some works considers the load balancing between controllers in the failure recovery stage, for instance, 
the solutions FT-SDN [12], LBFTFB [13], ECFT [14], and CALB [15] minimize the problem of cascading failure 
since it distributes the orphaned switches to the active controllers taking into account the controllers’ load. All 
these approaches redistribute the orphans’ switches considering only the controllers’load, but none of these 
methods consider the load of the switches in the recovery process. In our proposed FTLBC model we take into 
account in addition to the controllers’load, the switches’load. Two criteria are used in our model: controllers’ load 
and switches’load to migrate the orphan switches to the active controllers. It is possible to optimize this 
reassignment locally or globally. In the local reassignment, switches are migrated individually, which increases 
latency. On the other hand, the global reassignment results when all the orphan switches can be migrated 
simultaneously to a new controller. In our FTLBC model, switch-controller migration is done locally and also 
globally. i.e. we can reassign at the same time one or more switches from the failing controller to other controllers. 
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2.2. Topology used for FTLBC model 

In SDN implementation there exist two types of architectures a centralized controller architecture and 
a multiple controller architecture as shown in Figure |. The centralized controller architecture can be seen in 
Figure 1(a) is simpler to manage but inherently unreliable [16]. Concentrating all network intelligence into a 
single controller raises significant issues that have an impact on scalability, performance, and the fault tolerance 
of the SDN, leading to a SPOF. When an SDN controller breaks, the switches lose the capacity to forward new 
packets and then the network crashes. 

The multiple controller architecture as shown in Figure 1(b) promises to solve the single SDN 
controller's shortcomings [17]. The basic concept is to have different controllers that can share the load equally 
across the network. Also, one controller can relay another controller if it fails [18]. Moreover, this solution is 
known to improve reliability and fault tolerance. In short, we can state that a multi-controller architecture is 
necessary while fault tolerance or availibility is the problem. However, the limitation of this solution is that it 
requires the synchronization overhead between the controllers for consistency reasons, the controllers must 
share their network information through the east or westbound interfaces. Since the objective of our FTLBC 
model is to enhance fault tolerance in controllers, we chose to use a multiple SDN controller architecture. 


C1C2Cn 


Figure 1. The centralized controller architecture; (a) with single controller and (b) the multiple controller 
architecture 


In a multiple SDN controller architecture, controller disposition is either hierarchical (vertical 
disposition) or fully distributed (flat disposition) [18] as shown in Figure 2. In a hierarchical model as shown 
in Figure 2(a), just the roots controller possesses the status of the global network and controls it. Some examples 
of hierarchical models are Kandoo [19], improved Kandoo [20], [21], and FlowBroker [22]. This architecture 
may not tolerate failures since it uses a root controller wish can be a SPOF. 

Conversely, in a fully-distributed model as shown in Figure 2(b), each controller controls a subnet of 
the network. Some examples of fully-distributed models are Onix [23], Open Network Operating System 
(ONOS) [24], distributed multi-domain SDN controllers (DISCO) [25], and ElastiCon [26]. In this architecture, 
the global network state is stored and maintained by the SDN controllers in a data store, which may result in a 
significant cost of synchronization to add the controllers, but it remains the most appropriate solution for 
improving fault tolerance and reliability. In our FTLBC model, we have chosen a flat architecture, since the 
hierarchical architecture can't help us to manage the controller's failures. 


2.3. Synchronization between controllers 

To create a global network state, the controllers must share topology information between them. The 
East or Westbound interfaces are introduced for SDN distributed architecture [27]. This API provides 
connectivity across controllers for the purpose of synchronization. For example, SDN controllers may use this 
API to communicate their local network status and create a global network status. 

When one of the SDN controllers breaks, active controllers, need to be prepared to assume the role of 
defaulting controller. Using the East or Westbound interfaces, the FTLBC model will transmit the essential 
information, including the name and the availability between controllers. Furthermore, an abstract overview of 
the network topology of each controller, comprising controller’ load and switch’ load, can be transmitted 
through this interface to offer a global view of the network. 
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Figure 2. A view of distributed controller architecture; (a) hierarchical controller(root) and (b) fully 
distributed controller (flat) 


3. METHOD 

The FTLBC model distributes the failing controller's switches to the other controllers, taking into 
account two metrics: Switches ' load and Controllers’ load. 

With a large scale of flow inputs, the controller handles a large flow table and the controller’s load 
will be high. The switches’ load includes a total number of flow table inputs, and the controllers' load is an 
accumulation of the load of the switches. According to the reciprocal relationship a shown in Table 2, the least 
loaded switches are mapped to the most heavily loaded controls and the most heavily loaded switches are 
mapped to the least loaded controls. 


Table 2. Reciprocal relation between the switch’s load and the controllers! load 


Metric 1 Relation Metric 2 
Switch Load = => ——— Controller Load 
Switch Load => ——> ___ Controller Load 


To explain the efficiency of our recovery method, we will take the example of the architecture shown 
in Figure 3. In this architecture we consider three controllers: C1, C2, and C3. Cl is the master controller of 
network1, C2 is the master controller of network2 and C3 is the master controller of network3. we assume that 
the network controlled by C1 (network1) is located in an industrial area; in this area, the load of the switches 
will be too high because the switches send a lot of requests to the controller C1 and also the activity of this 
controller will be high; we consider that the network controlled by C3 (network3) is located in a rural area; in 
this area, the load of the switches will be reduced because they send only a few requests to the C3 controller 
and also the activity of this controller will be reduced; and we assume that the network of the C2 controller 
(network2) is located in a normal area (between an industrial area and a rural area) where the load of the 
switches and the controller is normal (neither too high nor too low). 


C1 (Load :80%) C2 (Load :60%) C3 (Load :40%) 
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Network 1: Industrial Zone Network2: Normal zone Network3: rural zone 


Figure 3. The architecture used for the recovery process in FTLBC model 
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Considering that controller C2 has failed, the controller chosen as the leader (C1 or C3) will redistribute 
the load of C2 (the switches of C2) among other controllers, taking into consideration the load on C2's switches 
and the load on other controllers (C1 and C3). As it is shown in Figure 3, we assume that the load of C1, C2, and 
C3 are successively 80%, 60%, 40%, and the load of switches of C2: $4, S5, S6, and S7 are successively 10%, 
20%, 30%, 40%. if C2 has failed, we will assign the less loaded switches of C2 to the more loaded controller; that 
is to say in this case we will assign S4 and S5 to Cl, and we will assign the more loaded switches of C2 to the 
less loaded controller; that is to say, assign S6 and S7 to C3. So, in this case, if C2 has failed the switches S4 and 
S5 will be assigned to C1, and the switches S6 and S7 will be assigned to C3. 


4. RESULTS AND DISCUSSION 
As aresult, in this section we discuss the recovery algorithm in Algorithm 1| against controller failures. 

This approach is described in: 

= Firstly, the leader controller must store the active and unloaded controllers in a list L, and after order these 
controllers in decreasing order (lines 1-6 of Algorithm 1). 

- Afterward, the leader ordered the orphans' switches in a list T, in increasing order (lines 7-10 as shown in 
Algorithm 1). 

- Then assigns the variable q to the number of functioning controllers, and the variable step to integer 
division of the number of the orphans' switches n / on the number of active controller’s q (line 11 of 
Algorithm 1). 

- After, the leader checks if the number of the orphans' switches is lower than the number of active 
controllers, then assigns the variable step to 1 and modifies q (the size of the active controllers) to the 
number of the orphans' switches. In this case, we will assign the switches just to the first controllers (lines 
12-14 of Algorithm 1). 

- Afterward, go through the two lists: the list of active controllers (L) and the list of the orphans’ switches 
(T), and assign the switches to the controllers respecting the variable step that represents the number of 
switches to assign to each controller (line 15-21 of Algorithm 1). 

- The test:!(T. isEmpty () (line 16 of Algorithm 1) concerns the case when the number of switches n is higher 
than the number of controllers q. In this case, the remaining switch (s) in the list (T) will be assigned to the 
last controller. 


Algorithm 1. The recovery algorithm 

Input: -Number of network controllers (i = 1 to m), the Controllers designed by Ci 
-Number of the orphans' switches (j = 1 to n), the Switches designed by Si 
-L= []: an empty list to fill with controllers in a decreasing order 
-T= []: an empty list to fill with the orphans' switches in increasing order. 

Procedure: 

1: For i:1 to m do 

2: IF Ci isActive () && load(Ci) < capacity(Ci) Then 

3 : L.add (Ci, load(Ci)) 

4: End IF 

5: End For 

6: sortDescending (L) 

7S For jil ten. de 

8: T.add (Si, load(Si) ) 

9: End For 

sortAscending (T) 

k=1; q= length(L); step=n/q ; p=step ; //q is the number of active controllers 

IF (step==0) Then 

step=1; p=step; q=n; //numbers of switches (n)< numbers of controllers (q) 

End IF 

For i:1 to q do 

For j: k to step and !(T. isEmpty ()) do 

affecter T[j] a L[i] ; 

End For //end For 1 

k=j; 

step=k+p; 

: End For //end For 2 


FPOWMOMAATIAUOFPWNEF Oe 


NN 


To implement our recovery algorithm, we'll look at the topology in Figure 3 that has three SDN 
controllers in charge of various subnetwork topologies at internet protocol (IP) addresses C1=10.10.10.1, 
C2=10.10.10.2, and C3=10.10.10.3. To simulate the network, we use the network emulation tool Mininet [28]. 
The three controllers are emulated with the Floodlight controller [29]. Floodlight is a Java OpenFlow controller 
[30] widely used by developers. It provides a framework that can be used to easily enhance SDNs. 
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In this study, we install the Mininet tool and Floodlight controller in a virtual machine. The final 
results of our implementation are currently being processed. All experiments were performed on a computer 
with an Intel Core 15, 8250U 1.80 GHz CPU, and 16 GB RAM. Table 3 summarizes the tools used for our 
FTLBC model. 


Table 3. The simulation testbed’ tools 


Tool Feature Version 
Mininet [28] Network emulator 22.1 
Floodlight [29] SDN controller 12 
OpenFlow [30] Programmable network protocol 1.5 
Windows Operating system on PC Windows 10 64bit 
Oracle VM virtual box _ Virtualizing software 6.0 
Linux Virtual machine's operating system Ubuntu 16.04 64bit 
Hping3 Traffic Version 3 


5. CONCLUSION 

In this research, we design a new model called fault-tolerant load balancing for SDN controllers. In 
our approach, the load of the failing controller has to be shared across the remaining controllers to reduce the 
cascading failure problem. In case of a controller failure, the FTLBC model concentrates on distributing the 
load across remaining controllers taking into account two metrics: the load of orphans' switches and the load 
of other controllers. Compared to the traditional SDN architecture, the FTLBC model offers enhanced fault 
tolerance functionalities and better management of independent controllers. As further research for this work, 
we will test the reliability of our FTLBC model compared to other models that exist in the literature. And also, 
we would like to develop a more sophisticated model that uses artificial intelligence techniques. 
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